Figure 1 shows the complex set of relationships between the data in the Neotoma Paleoecological Database. The development of a dedicated database to manage these relationshios implicates that the complexity of the data exceeded the ability of traditional data management techniques.
Figure 2A shows the steady increase in datasets in the Neotoma Paleoecological Database.
Figure 2A shows the massive influx of occurrence records in the Global Biodiversity Information Facility. Note that digitization of existing records allows GBIF’s holdings to preceed it’s organization in 2001.
df$type <- as.factor(df$type)
ggplot(df, aes(df$type)) + geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Dataset Types in Neotoma") +
xlab("") + ylab("Count") Figure 3A shows the relative proportion of each of the 23 dataset types in the Neotoma Paleoecological Database.
Figure 3B shows the relative proportion of each of the eight record types in the GBIF dataset.
Figure 4 shows that the recent growth in citations for ecological forecasting models far outpaces the average citation growth in all of STEM fields. SDM citation growth was established from a Web of Science query for (“Ecolgical Niche Model” OR “Species Distribution Model” OR “Habitat Suitability Model”) and average citation growth was derived from the National Science Board report on Science and Engineering indicators (2014).
Figure 5 reports the relative proportions of algorithms used in 100 randomly sampled modeling studies. Instances were classified according to their classification in the data-driven/model-drive/Bayesian framework. In total, 203 model instances were reviewed in 100 papers. 42 unique algorithms were employed.
Figure 6 demonstrates the cost surface faced by consumers of Google’s Cloud Computing Engine. Rates are in $/hr. Note the tradeoff in relative increases in one of the computing components for the same total rate.
Figures from this point onwards are not referenced in the thesis yet
## [1] "Runtime Model Mean Squared Error: 0.0739031183054413"
## [1] "Runtime Model Percent Variance Explained: 0.954985410583547 %"
## [1] "Accuracy Model Mean Squared Error: 0.000239655748946652"
## [1] "Accuracy Model Percent Variance Explained: 0.87389775011381 %"
## [1] "Runtime Model Mean Squared Error: 0.0126826622863348"
## [1] "Runtime Model Percent Variance Explained: 0.517420789395359 %"
## [1] "Accuracy Model Mean Squared Error: 0.000278476494590642"
## [1] "Accuracy Model Percent Variance Explained: 0.851874011938511 %"
## [1] "Runtime Model Mean Squared Error: 0.0610090355838578"
## [1] "Runtime Model Percent Variance Explained: 0.961824641935391 %"
## [1] "Accuracy Model Mean Squared Error: 0.000277921565501084"
## [1] "Accuracy Model Percent Variance Explained: 0.852169187370878 %"
## [1] "Runtime Model Mean Squared Error: 0.675280667976313"
## [1] "Runtime Model Percent Variance Explained: 0.540011102202936 %"
## [1] "Accuracy Model Mean Squared Error: 0.000277525336257943"
## [1] "Accuracy Model Percent Variance Explained: 0.852379947881295 %"
res <- read.csv("thesis-scripts/data/rf_full.csv")
library(plyr)##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
## The following object is masked from 'package:maps':
##
## ozone
res$grp <- interaction(res$method, res$cores, res$trainingExamples, res$numTrees)
resSum <-ddply(res, .(cores, trainingExamples, numTrees, method), summarize, meanTotalTime = mean(totalTime))
resSum$grp <- as.factor(interaction(resSum$trainingExamples, resSum$numTrees, resSum$cores))
resSplit <- split(resSum, resSum$grp)
parResults <- data.frame(cores = vector('numeric', length=length(resSplit)),
trainingExamples = vector('numeric', length=length(resSplit)),
numTrees = vector('numeric', length=length(resSplit)),
speedup = vector('numeric', length=length(resSplit)),
efficiency = vector('numeric', length=length(resSplit)))
for (i in 1:length(resSplit)){
item <- resSplit[[i]]
par <- item[1,]
ser <- item[2, ]
ncores <- par$cores
Tex <- par$trainingExamples
nt <- par$numTrees
speedup <- ser$meanTotalTime / par$meanTotalTime
eff <- ser$meanTotalTime / par$meanTotalTime / ncores
v <- c(ncores, Tex, nt, speedup, eff)
parResults[i, ] <- v
}
##plot speedup
ggplot(parResults, aes(x = cores, y=speedup,
group=interaction(trainingExamples, numTrees),
col = interaction(trainingExamples, numTrees))) +
geom_line() + ggtitle("Parallel Speedup of Random Forests") Figure 19 shows that more expensive workloads benefit more from additional cores than simple modeling routines.
## and efficiency
ggplot(parResults, aes(x = cores, y=efficiency,
group=interaction(trainingExamples, numTrees),
col = interaction(trainingExamples, numTrees))) +
geom_line() + ggtitle("Parallel Efficiency of Random Forests") Figure 20 shows the diminishing marginal returns of using additional cores. Note that simple workflows, though benefiting from additional cores, drop off steeply, while complex workloads decline nearly linearly.